Combining Models

Introduction

committees: Train \(L\) different models and then make predictions using the average of the predictions made by each model.
boosting: Train multiple models in sequence in which the error function used to train a particular model depends on the performance of the previous models.
decision trees: Different models are responsible for making predictions in different regions of input space.
mixtures of experts: Models are viewed as mixture distributions in which the component densities,as well as the mixing coefficients,are conditioned on the input variables. \[\begin{aligned} p(t|\vec{x})=\sum\limits_{k=1}^{K}\pi_k(\vec{x})p(t|\vec{x},k) \end{aligned}\] in which \(\pi_k(\vec{x})=p(k|\vec{x})\) represents the input-dependent mixing coefficients,and \(k\) indexes the model.

Bayesian Model Averaging

In Bayesian model averaging the whole data set is generated by a single model.By contrast,when we combine multiple models,we see that different data points within the data set can potentially be generated from different values of the latent variable \(\vec{z}\) and hence by different components.

Committees

The simplest way to construct a committee is to average the predictions of a set of individual models,to cancel the contribution arising from variance and bias.

Bootstrap data to introduce variability between the different models within the committee.Suppose we generate \(M\) bootstrap data sets \[\begin{aligned} y_{COM}(\vec{x})=\dfrac{1}{M}\sum\limits_{m=1}^{M}y_m(\vec{x})\end{aligned}\] where \(m=1,...,M\).This procedure is known as bootstrap aggregation or bagging.

Boosting

Here we describe the most widely used form of boosting algorithm:AdaBoost,short for ’adaptive boosting’.The base classifiers are known as weak learners and are trained in sequence using a weighted form of the data set in which the weighting coefficient associated with each data point depends on the performance of the previous classifiers.

Consider a two-class classification problem,in which the training data comprises input vectors \(\vec{x}_1,...,\vec{x}_N\) along with corresponding binary target variables \(t_1,...,t_N\) where \(t_n\in \{-1,1\}\).Each data point is given an associated weighting parameter \(w_n\),initially set \(1/N\) for all.A base classifier function \(y(\vec{x})\in \{-1,1\}\)

[H]

1. Initialize the data weighting coefficients \(\{w_n\}\) by setting \(w_n^{1}=1/N\) for \(n=1,...,N\). 2. 3. Make predictions using the final model,which is given by \[\begin{aligned} Y_M(\vec{x})=sign(\sum_{m=1}^{M}\alpha_m y_m(\vec{x})) \end{aligned}\]

Minimizing exponential error

Consider the exponential error function defined by \[\begin{aligned} E=\sum_{n=1}^{N}\exp\{-t_n f_m(\vec{x}_n)\}\end{aligned}\] where \(f_m(\vec{x})\) is a classifier defined in terms of a linear combination of base classifiers \(y_l(\vec{x})\) of the form \[\begin{aligned} f_m(\vec{x})=\dfrac{1}{2}\sum_{l=1}^{m}\alpha_l y_l(\vec{x})\end{aligned}\] and \(t_n\in \{-1,1\}\) are the training set target values.Our goal is to minimize \(E\) with respect to the weighting coefficients \(\alpha_l\) and parameters of the base classifiers \(y_l(\vec{x})\).

Separating off the contribution from base classifier \(y_m(\vec{x})\), \[\begin{aligned} E&=\sum_{n=1}^{N}\exp\{-t_n f_m(\vec{x}_n)\} \\ &=\sum_{n=1}^{N}\exp\{-t_n \dfrac{1}{2}\sum_{l=1}^{m}\alpha_l y_l(\vec{x}) \} \\ &=\sum_{n=1}^{N}\exp\{-t_n f_{m-1}(\vec{x}_n)-\dfrac{1}{2}t_n \alpha_m y_m(\vec{x}_n)\} \\ &=\sum_{n=1}^{N}w_n^{(m)}\exp\{-\dfrac{1}{2}t_n \alpha_m y_m(\vec{x}_n)\}\end{aligned}\] where the coefficients \(w_n^{(m)}=\exp\{-t_n f_{m-1}(\vec{x}_n)\}\) can be viewed as constants because we are optimizing only \(\alpha_m\) and \(y_m(\vec{x})\).Denote by \(\mathcal{T}_m\) the set of data points correctly classified by \(y_m(\vec{x})\) and misclassified points by \(\mathcal{M}_m\),then we in turn rewrite the error function \[\begin{aligned} E &=e^{-\alpha_m/2}\sum_{n\in\mathcal{T}_m}w_n^{(m)}+e^{\alpha_m/2}\sum_{n\in\mathcal{M}_m}w_n^{(m)} \\ &=(e^{\alpha_m/2}-e^{-\alpha_m/2})\sum_{n=1}^{N}w_n^{(m)}I(y_m(\vec{x}_n)\neq t_n)+e^{-\alpha_m/2}\sum_{n=1}^{N}w_n^{(m)}\end{aligned}\] Then we can minimize this with respect to \(y_m(\vec{x}_n)\) and \(\alpha_m\). \[\begin{aligned} \because w_n^{(m)}=\exp\{-t_n f_{m-1}(\vec{x}_n)\} \\ \therefore w_n^{(m+1)}=\exp\{-t_n f_{m}(\vec{x}_n)\} \\ \therefore w_n^{(m+1)}=w_n^{(m)}\exp\{-\dfrac{1}{2}t_n\alpha_m y_m(\vec{x}_n)\}\end{aligned}\] Making use of the fact that \[\begin{aligned} t_n y_m(\vec{x})=1-2I(y_m(\vec{x}_n)\neq t_n)\end{aligned}\] we see updates at the next iteration \[\begin{aligned} w_n^{(m+1)}=w_n^{(m)}\exp(-\alpha_m/2)\exp\{\alpha_m I(y_m(\vec{x}_n)\neq t_n)\}\end{aligned}\] Because the term \(\exp(-\alpha_m/2)\) is independent of \(n\),so can be discarded.

Combining Models

Introduction

Bayesian Model Averaging

Committees

Boosting

Minimizing exponential error

Error functions for boosting

Tree-based Models

Conditional Mixture Models